Climate change has been an ongoing problem and one of the main factors is Carbon Dioxide emissions. The problem I want to solve: Is it possible to predict the amount of carbon dioxide a state emits with certain predictors? I want to learn about what can help us predict carbon dioxide emisisons.
This data set has 50 rows and 19 variables. All of the data is from 2020. For this analysis, I will not be using ‘State’ as a variable. I also won’t be using ‘CoalTrans’ as a variable because it is 0 for all states. The data set I’m using is a combination of different data sets. Since there are only 50 rows of data, I will not do training & validation sets for classification models.
VARIABLES TO PREDICT WITH
VARIABLES WE WANT TO PREDICT
State TotCO2 HighLow TotEnergy
Length:50 Min. : 5.40 Min. :0.00 Min. : 125.7
Class :character 1st Qu.: 39.42 1st Qu.:0.00 1st Qu.: 675.2
Mode :character Median : 67.05 Median :0.00 Median : 1479.7
Mean : 91.80 Mean :0.28 Mean : 1854.6
3rd Qu.:105.35 3rd Qu.:1.00 3rd Qu.: 2214.6
Max. :624.00 Max. :1.00 Max. :13480.8
Coal NaturalGas Petroleum TotFF
Min. : 0.0 Min. : 0.2 Min. : 68.4 Min. : 84.0
1st Qu.: 19.3 1st Qu.: 265.5 1st Qu.: 224.0 1st Qu.: 619.7
Median :146.1 Median : 364.3 Median : 444.9 Median : 1009.7
Mean :183.7 Mean : 629.9 Mean : 646.7 Mean : 1460.2
3rd Qu.:248.1 3rd Qu.: 739.5 3rd Qu.: 682.0 3rd Qu.: 1549.6
Max. :872.8 Max. :4708.4 Max. :6185.8 Max. :11767.1
NuclearElectricPower RenewableEnergy Residential Commercial
Min. : 0.0 Min. : 7.70 Min. : 36.8 Min. : 25.3
1st Qu.: 0.0 1st Qu.: 84.28 1st Qu.: 137.9 1st Qu.: 105.6
Median : 89.6 Median : 170.80 Median : 316.1 Median : 234.2
Mean : 165.0 Mean : 228.16 Mean : 409.7 Mean : 332.9
3rd Qu.: 300.2 3rd Qu.: 280.07 3rd Qu.: 512.6 3rd Qu.: 403.8
Max. :1046.8 Max. :1150.20 Max. :1744.1 Max. :1630.5
Industrial Transportation Pop CoalTrans
Min. : 17.7 Min. : 39.0 Min. : 577719 Min. :0
1st Qu.: 180.1 1st Qu.: 175.3 1st Qu.: 1871866 1st Qu.:0
Median : 379.0 Median : 382.5 Median : 4585405 Median :0
Mean : 625.4 Mean : 486.6 Mean : 6622169 Mean :0
3rd Qu.: 573.8 3rd Qu.: 598.8 3rd Qu.: 7576690 3rd Qu.:0
Max. :7265.9 Max. :2840.2 Max. :39576757 Max. :0
NaturalGasTrans PetroleumTrans TaxCredit
Min. : 0.00 Min. : 39.0 Min. :0.00
1st Qu.: 5.30 1st Qu.: 169.2 1st Qu.:0.00
Median : 11.65 Median : 360.4 Median :0.00
Mean : 21.90 Mean : 463.6 Mean :0.34
3rd Qu.: 24.85 3rd Qu.: 553.3 3rd Qu.:1.00
Max. :196.10 Max. :2642.5 Max. :1.00
This is a histogram of the variable TotCO2 emissions which is the total CO2 emissions from a state. Most of the states fall between 0 & 200 million metric tons of carbon dioxide.
We can see that the majority of states are considered “Low Emission” states, meaning that they produce less than 100 million metric tons of carbon dioxide. We can also see that “High Emission” states have a larger range of values and has 2 outliers.
I will be using prediction/estimation molding techniques to predict the amount of CO2 emissions for a state. The first technique I will use is a Multiple Linear Regression. Then I will run a Decision Tree Model and compare them.
For this model my predictors were: Coal, NaturalGas, Petroleum, Transportation, Pop, & TaxCredit
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| Coal | 0.097 | 0.001 | 72.110 | 0.000 |
| NaturalGas | 0.052 | 0.001 | 37.045 | 0.000 |
| Petroleum | 0.028 | 0.001 | 18.949 | 0.000 |
| PetroleumTrans | 0.032 | 0.004 | 7.939 | 0.000 |
| Pop | 0.000 | 0.000 | 4.682 | 0.000 |
| NaturalGasTrans | 0.044 | 0.016 | 2.755 | 0.009 |
| (Intercept) | 1.143 | 0.452 | 2.527 | 0.016 |
| RenewableEnergy | 0.004 | 0.001 | 2.347 | 0.024 |
| NuclearElectricPower | 0.001 | 0.001 | 0.661 | 0.512 |
| TaxCredit | -0.375 | 0.572 | -0.655 | 0.516 |
After examining this model, there are some predictors that are not important in predicting CO2, so a pruned version of the model is created by removing predictors that are not significant.
For this analysis we will use a pruned Multiple Linear Regression Model. I also removed predictors that had I didn’t think were important in predicting CO2 emissions. The 3 predictors involving transportation were significant (based on their p-values) so I decided to use the Transportation variable as a predictor because it’s the sum of CoalTrans, NaturalGasTrans, & PetroleumTrans. These are the predictors in the final model: Coal, NaturalGas, Petroleum, & Transportation.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| Coal | 0.094 | 0.001 | 63.235 | 0.000 |
| NaturalGas | 0.053 | 0.001 | 43.755 | 0.000 |
| Transportation | 0.050 | 0.001 | 35.613 | 0.000 |
| Petroleum | 0.024 | 0.001 | 21.954 | 0.000 |
| (Intercept) | 1.005 | 0.429 | 2.344 | 0.024 |
png(DecisionTree3.png =
“DecisionTree3.png”, width = 3, height = 3
I will be using classification modeling techniques to predict if a state would be considered a “high emissions” or “low emissions” state. The first classification technique I will use is a Nominal Logistic Model. Then I will run a Boosted Tree Model and compare the two.
png(Logistic3.png = “Logistic3.png”,
width = 3, height = 3
A multiple linear regression is the best model.
The predictors used were: Coal, NaturalGas, Petroleum, & Transportation.
The best model to use is a nominal logistic model but a boosted tree model also works very well.
The variables used were: Coal, NaturalGas, Petroleum, Pop, & Transportation.
Overall, all of the models created were useful and could predict/classify a state’s CO2 emissions well. The best models were the Multiple Linear Regression Model 2 and the Nominal Logistic Regression Model. The Multiple Linear Regression Model 1 was also good, but was more complicated. The Boosted Tree Model was almost the same as the Nominal Logistic Model. One thing that surprised me was that I didn’t need/use all of the variables I began with.
---
title: "Carbon Dioxide Emissions Data Analysis Project"
output:
flexdashboard::flex_dashboard: # this is telling r to create a dashboard, also make sure every
vertical_layout: scroll # part works before knitting the file together to make a dashboard
source_code: embed
---
-----------------------------------------------------------------------
```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```
```{r load_data}
df <- read_csv("INFO 3200 Official Project Data.csv")
```
Introduction & Data Exploration {data-orientation=rows}
=======================================================================
Row {data-height=800}
-----------------------------------------------------------------------
### The Problem & Data Collection
#### The Problem
Climate change has been an ongoing problem and one of the main factors is Carbon Dioxide emissions. The problem I want to solve: Is it possible to predict the amount of carbon dioxide a state emits with certain predictors? I want to learn about what can help us predict carbon dioxide emisisons.
#### The Questions
1. What variables can I use to predict a state's carbon dioxide emissions?
2. What is the best model to predict a states carbon dioxide emissions?
3. What variables can I use to predict whether a state would be considered a "higher emissions" or "low emissions" state based on my current variables?
4. What is the best model use to answer question 3?
#### The Data
This data set has 50 rows and 19 variables. All of the data is from 2020. For this analysis, I will not be using 'State' as a variable. I also won't be using 'CoalTrans' as a variable because it is 0 for all states. The data set I'm using is a combination of different data sets. Since there are only 50 rows of data, I will not do training & validation sets for classification models.
#### Data Sources
* 2020 Carbon Dioxide Emissions by State: https://www.eia.gov/state/rankings/#/series/226.
* EV Tax Credit: https://www.energysage.com/electric-vehicles/costs-and-benefits-evs/ev-tax-credits/
* State Energy Consumption Estimates (1960-2020): https://www.eia.gov/state/seds/sep_use/notes/use_print.pdf
* Transportation Sector Energy Consumption: https://www.eia.gov/state/seds/data.php?incfile=/state/seds/sep_sum/html/sum_btu_tra.html&sid=US
* US Census 2020 Population dataset: https://www.eia.gov/state/rankings/#/series/226
### The Data
VARIABLES TO PREDICT WITH
* *TotEnergy*: total energy consumed (in trillion btu)
* *Coal*: energy consumed from coal (in trillion btu)
* *NaturalGas*: energy consumed from natural gas (in trillion btu)
* *Petroleum*: energy consumed from natural gas (in trillion btu)
* *TotFF*: total energy consumed from fossil fuels, sum of Coal, NatualGas, & Petroleum (in trillion btu)
* *NuclearElectricPower*: energy consumed from nuclear electric power
* *RenewableEnergy*: energy consumed from renewable energy sources (in trillion btu)
* *Residential*: energy consumed by the residential sector (in trillion btu)
* *Commercial*: energy consumed by the commercial sector (in trillion btu)
* *Transportation*: energy consumed by the transportation sector (in trillion btu)
* *Pop*: population of a state
* *CoalTrans*: energy from coal used for transportation (in trillion btu)
* *NaturalGasTrans*:energy from natural gas used for transportation (in trillion btu)
* *PetroleumTrans*: energy from petroleum used for transportation (in trillion btu)
* *TaxCredit*: EV tax credit dummy variable (= 1 if tax credit; 0 otherwise)
VARIABLES WE WANT TO PREDICT
* *TotCO2*: total CO2 emissions of a state
* *HighLow*: CO2 emissions > 100 coded as 1, lower coded as 0
Row {data-height=750}
-----------------------------------------------------------------------
### Summary Stats
```{r,cache=TRUE}
summary(df)
```
Visualization 1 {data-icon="fa-signal"}
=====================================
### Response Variables:
#### Total CO2 Emissions (in million metric tons)
```{r,cache=TRUE}
ggplot(df,aes(TotCO2)) + geom_histogram(bins=20, fill="cadetblue") + scale_x_continuous(breaks=seq(0,700,50)) + scale_y_continuous(breaks=seq(0,15,2)) + labs(x = "CO2 Emissions (million metric tons)") + labs(y="Count")
```
Column {data-width=500}
-----------------------------------------------------------------------
This is a histogram of the variable TotCO2 emissions which is the total CO2 emissions from a state. Most of the states fall between 0 & 200 million metric tons of carbon dioxide.
Visualization 2 {data-icon="fa-signal"}
=====================================
Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables: CO2 Emissions: High (1)/Low(0)
#### Bar Chart
```{r, fig.width=5, fig.height=4.5}
as_tibble(select(df,HighLow) %>%
table()) %>%
ggplot(aes(y=n ,x=HighLow)) + geom_bar(stat="identity", fill="cadetblue") + scale_y_continuous(breaks=seq(0,40,5))
```
Column {data-width=500}
-----------------------------------------------------------------------
##### Box Plot
```{r, fig.width=4.5, fig.height=4.5}
ggplot(df,aes(x= HighLow, y=TotCO2, group=HighLow)) + geom_boxplot()
```
Row
-----------------------------------------------------------------------
We can see that the majority of states are considered "Low Emission" states, meaning that they produce less than 100 million metric tons of carbon dioxide. We can also see that "High Emission" states have a larger range of values and has 2 outliers.
Scatter Plots {data-icon="fa-signal"}
=====================================

Row
-----------------------------------------------------------------------





TotCO2 Analyses {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Predict CO2 Emissions Models
I will be using prediction/estimation molding techniques to predict the amount of CO2 emissions for a state. The first technique I will use is a Multiple Linear Regression. Then I will run a Decision Tree Model and compare them.
Row
-----------------------------------------------------------------------
### Predict CO2 Emissions Model 1 (M1)
For this model my predictors were: Coal, NaturalGas, Petroleum, Transportation, Pop, & TaxCredit
Row
-----------------------------------------------------------------------
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
M1 <- lm(TotCO2 ~ Pop + Coal + NaturalGas + Petroleum + NuclearElectricPower + RenewableEnergy + CoalTrans + NaturalGasTrans + PetroleumTrans + TaxCredit, data = df)
summary(M1)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(M1)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(M1)$adj.r.squared,4)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(M1)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r,include=FALSE, cache=TRUE}
#knitr::kable(summary(MEDV_lm)$coef, digits = 3) #pretty table output
summary(M1)$coef
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(M1))[,4])
out <- coef(summary(M1))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After examining this model, there are some predictors that are not important in predicting CO2, so a pruned version of the model is created by removing predictors that are not significant.
Row
-----------------------------------------------------------------------
### Predict CO2 Emissions Model 2 (M2)
For this analysis we will use a pruned Multiple Linear Regression Model. I also removed predictors that had I didn't think were important in predicting CO2 emissions. The 3 predictors involving transportation were significant (based on their p-values) so I decided to use the Transportation variable as a predictor because it's the sum of CoalTrans, NaturalGasTrans, & PetroleumTrans. These are the predictors in the final model: Coal, NaturalGas, Petroleum, & Transportation.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
M2 <- lm(TotCO2 ~ Coal + NaturalGas + Petroleum+ Transportation, data = df)
summary(M2)
```
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(M2)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(M2)$adj.r.squared,4)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(M2)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(M2)$coef, digits = 3) #pretty table output
```
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(M2))[,4])
out <- coef(summary(M2))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
Row
-----------------------------------------------------------------------

png(DecisionTree3.png = "DecisionTree3.png", width = 3, height = 3
Row
-----------------------------------------------------------------------
### TotCO2 Model Comparison

HighLow Analyses
=====================================
### Predict High/Low Emissions Models
I will be using classification modeling techniques to predict if a state would be considered a "high emissions" or "low emissions" state. The first classification technique I will use is a Nominal Logistic Model. Then I will run a Boosted Tree Model and compare the two.
### Nominal Logistic Model (M4)

png(Logistic3.png = "Logistic3.png", width = 3, height = 3
### Boosted Tree Model (M5)

### HighLow Model Comparison

Conclusion
=====================================
### Questions & Answers:
1. What is the best model to predict a states carbon dioxide emissions?
A multiple linear regression is the best model.
2. What variables can I use to predict a state's carbon dioxide emissions?
The predictors used were: Coal, NaturalGas, Petroleum, & Transportation.
3. What is the best model use to predict whether a state would be considered a "higher emissions" or "low emissions" state based on my current variables?
The best model to use is a nominal logistic model but a boosted tree model also works very well.
4. What are the variables that can be used in the model?
The variables used were: Coal, NaturalGas, Petroleum, Pop, & Transportation.
Overall, all of the models created were useful and could predict/classify a state's CO2 emissions well. The best models were the Multiple Linear Regression Model 2 and the Nominal Logistic Regression Model. The Multiple Linear Regression Model 1 was also good, but was more complicated. The Boosted Tree Model was almost the same as the Nominal Logistic Model. One thing that surprised me was that I didn't need/use all of the variables I began with.